This Markdown file is for the report exercise of Chapter 9 (Supervised Machine learning I) of the AGDS Course. The exercise mostly centers around k-nearest-neighbors models, how they compare to linear regression when predicting GPP data by splitting data into training and test-datasets, and what role the k (number of neighbors) has on the models and how tuning that number influences modelling. This is done in part through visualization and metrics like RMSE and MAE to calculate the goodness of the models and analyze their bias-variance tradeoff.
First, we sort and clean the data.
daily_fluxes <- readr::read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv") |>
# select only the variables we are interested in
dplyr::select(TIMESTAMP,
GPP_NT_VUT_REF, # the target
ends_with("_QC"), # quality control info
ends_with("_F"), # includes all all meteorological covariates
-contains("JSB") # weird useless variable
) |>
# convert to a nice date object
dplyr::mutate(TIMESTAMP = ymd(TIMESTAMP)) |>
# set all -9999 to NA
dplyr::mutate(across(where(is.numeric), ~na_if(., -9999))) |>
# retain only data based on >=80% good-quality measurements
# overwrite bad data with NA (not dropping rows)
dplyr::mutate(GPP_NT_VUT_REF = ifelse(NEE_VUT_REF_QC < 0.8, NA, GPP_NT_VUT_REF),
TA_F = ifelse(TA_F_QC < 0.8, NA, TA_F),
SW_IN_F = ifelse(SW_IN_F_QC < 0.8, NA, SW_IN_F),
LW_IN_F = ifelse(LW_IN_F_QC < 0.8, NA, LW_IN_F),
VPD_F = ifelse(VPD_F_QC < 0.8, NA, VPD_F),
PA_F = ifelse(PA_F_QC < 0.8, NA, PA_F),
P_F = ifelse(P_F_QC < 0.8, NA, P_F),
WS_F = ifelse(WS_F_QC < 0.8, NA, WS_F)) |>
# drop QC variables (no longer needed)
dplyr::select(-ends_with("_QC"))
Then comes data-splitting and training.
# Data splitting
set.seed(1982)
daily_fluxes <- daily_fluxes[!(colnames(daily_fluxes)) == "LW_IN_F"]
#due to the high amount of NA values in LW_IN_F, and the according to the script is a priori not critical for GPP predictions, we drop this variable. I thought the things done further down with the recipes would fix that by not using that variable in the model but for some reason it didn't work so I do this here removing it by hand
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")
daily_fluxes_train <- rsample::training(split)
daily_fluxes_test <- rsample::testing(split)
# Model and pre-processing formulation, use all variables but LW_IN_F
pp <- recipes::recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F,
data = daily_fluxes_train |> drop_na()) |>
recipes::step_BoxCox(all_predictors()) |>
recipes::step_center(all_numeric(), -all_outcomes()) |>
recipes::step_scale(all_numeric(), -all_outcomes())
# Fit linear regression model
mod_lm <- caret::train(
pp,
data = daily_fluxes_train |> drop_na(),
method = "lm",
trControl = caret::trainControl(method = "none"),
metric = "RMSE"
)
summary(mod_lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1728 -1.0709 -0.1613 0.9008 7.3409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.21581 0.02446 131.47 <2e-16 ***
## SW_IN_F 1.16519 0.03297 35.34 <2e-16 ***
## VPD_F -0.81775 0.03631 -22.52 <2e-16 ***
## TA_F 1.93384 0.03411 56.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.593 on 4238 degrees of freedom
## Multiple R-squared: 0.6645, Adjusted R-squared: 0.6642
## F-statistic: 2798 on 3 and 4238 DF, p-value: < 2.2e-16
# Fit KNN model
mod_knn <- caret::train(
pp,
data = daily_fluxes_train |> drop_na(),
method = "knn",
trControl = caret::trainControl(method = "none"),
tuneGrid = data.frame(k = 8),
metric = "RMSE"
)
#
The plot and evaluation of the linear regression model:
source("./eval_model.R")
#linear regression model
eval_model(mod = mod_lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test)
Figure 1: Evaluation of linear regression model fit vs. GPP measurements
The training and test models are extremely similar (basically identical actually), meaning the trained model does well when it comes to predicting unseen training data.
Doing the same for the k-nearest neighbor model:
# KNN
eval_model(mod = mod_knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test)
Figure 2: Evaluation of k-nearest-neighbor model fit vs. GPP measurements
The R2 and RMSE are better than for the linear regression model, though this time there is a slightly bigger difference in terms of how the model performs on training and test data, but it remains good in terms of model generalisability.
The KNN model could be slightly over- or underfitting for the training data, meaning when applying the model that works for the training data onto the testing data, it is a worse fit due to the slight “wiggle” that the line has. The lm on the other hand is just a fully straight line, which means it doesn’t fit as well in general but at the same time differences between training and testing data do not influence the RMSE much.
Because it is more flexible than “just a straight line” which is the lm.
Linear Regressions have a high bias, as they are not affected by the noise in observations, but its predictions are further away from the observations (just like model poly1 in the chapter) because it oversimplifies the model. KNN can be low bias and high variance when using a k which is too small, or high bias and low variance when using a k which is too big.
This first plot is about the full temporal variation for both models and the measured variable, using training as well as testing data.
colors1 <- c("GPP Measured" = "black", "Linear fit" = "#357cbc", "KNN fit" = "#e41d1d")
ggplot() +
geom_point(data = daily_fluxes, aes(x = TIMESTAMP, y = GPP_NT_VUT_REF, color = "GPP Measured"), alpha = 0.8, size = 1.8) +
geom_line(data = lm_fit_df, aes(x = TIMESTAMP, y = fitted, color = "Linear fit"), alpha = 0.8, size = 1.2) +
geom_line(data = KNN_fit_df, aes(x = TIMESTAMP, y = fitted, color = "KNN fit"), alpha = 0.5, size = 1.1) +
scale_color_manual(values = colors1) +
labs(x = "Year", y = "GPP") +
theme(axis.text = element_text(size = 20), axis.title = element_text(size = 30)) +
theme(legend.key.size = unit(2, "cm"),
legend.text = element_text(size = 20), legend.title = element_blank())
Figure 3: Temporal variations of observed (black) and modelled GPP with KNN (red) and linear regression (blue) for the period 1997-2014
# for training datasets only
plot1 <- ggplot() +
geom_point(data = daily_fluxes_train, aes(x = TIMESTAMP, y = GPP_NT_VUT_REF, color = "GPP Measured"), alpha = 0.8, size = 0.7) +
geom_line(data = daily_fluxes_train, aes(x = TIMESTAMP, y = lm_fitted, color = "Linear fit"), alpha = 0.8, size = 0.6) +
geom_line(data = daily_fluxes_train, aes(x = TIMESTAMP, y = knn_fitted, color = "KNN fit"), alpha = 0.5, size = 0.5) +
scale_color_manual(values = colors1) +
labs(title = "Training data", x = "Year", y = "GPP") +
theme(axis.text = element_text(size = 8), axis.title = element_text(size = 9)) +
theme(legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 6), legend.title = element_blank())
#for testing datasets only
plot2 <- ggplot() +
geom_point(data = daily_fluxes_test, aes(x = TIMESTAMP, y = GPP_NT_VUT_REF, color = "GPP Measured"), alpha = 0.8, size = 0.7) +
geom_line(data = daily_fluxes_test, aes(x = TIMESTAMP, y = lm_fitted, color = "Linear fit"), alpha = 0.8, size = 0.6) +
geom_line(data = daily_fluxes_test, aes(x = TIMESTAMP, y = knn_fitted, color = "KNN fit"), alpha = 0.5, size = 0.5) +
scale_color_manual(values = colors1) +
labs(title = "Testing data", x = "Year", y = "GPP") +
theme(axis.text = element_text(size = 8), axis.title = element_text(size = 12)) +
theme(legend.key.size = unit(1.2, "cm"),
legend.text = element_text(size = 8), legend.title = element_blank())
plot_grid(plot1, plot2, nrow = 2)
Figure 4: Temporal variations of observed (black) and modelled GPP with KNN (red) and linear regression (blue) for the period 1997-2014, separated by training and testing data
It is visually quite hard to determine any clear trends through these visualizations. It does seem like the linear model quite consistently go slightly further in the extremes than the knn (on all three graphics the blue lines go further up and down than the red lines), but are still far from being able to predict the actual extreme GPP values.
k = 1 would mean that the line would run through all the training observation, leading to an R2 of 1.00 for the training data, but it would very likely have a bad R2 as well as mean error when applying to the testing dataset (overfitting, high variance).
k = N means a too generalized model, which should perform somewhat poorly in both R2 and MAE, and values for training and test models should be close each other (underfitting, high bias)
The code to test different values of k is best done with a loop that goes through the different values of k we want to examine and does the same training and testing as before, but with every k. With the same principle, we can create correlation plots as well as plots for temporal variation.
#first the loop
plots <- list()
moreplots <- list()
different_ks <- c(1, 2, 5, 8, 15, 50, 150, 300)
knn_diff_k <- list()
all_models <- list()
for (i in 1:length(different_ks)){
knn_diff_k[[i]] <- caret::train(
pp,
data = daily_fluxes_train |> drop_na(),
method = "knn",
trControl = caret::trainControl(method = "none"),
tuneGrid = data.frame(k = different_ks[i]),
metric = "RMSE")
all_models[[i]] <- eval_model(knn_diff_k[[i]], daily_fluxes_train, daily_fluxes_test)
daily_fluxes_train$knn_fitted <- predict(knn_diff_k[[i]], newdata = daily_fluxes_train)
plots[[i]] <- ggdraw(all_models[[i]]) + draw_label(paste("k =", different_ks[i]), x = 0.45, y = 0.96, color = 'red') +
labs(x = "year", y = "GPP")
moreplots[[i]] <- ggplot() +
geom_point(data = daily_fluxes_train, aes(x = TIMESTAMP, y = GPP_NT_VUT_REF), color = 'black', alpha = 0.5) +
geom_point(data = daily_fluxes_train, aes(x = TIMESTAMP, y = knn_fitted), color = '#e41d1d', alpha = 0.5) +
labs(x = "year", y = "GPP")
}
moreplots
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
#Due to cowplot not working when directly writing it directly into the RMarkdown, I put the graphics together in a function plots_draw in a separate file plot_helper.R which will be in the git as well.
source("./plot_helper.R")
plots_draw(plots)
Figure 5: GPP vs. knn model fit and evaluation on both training and test sets for 8 different values on k
These plots are done to compare test and train model fits with the actual GPP measurements, which is done with the same eval_model function as before. With k = 1 it’s very visible how GPP and fit are exactly the same for the train graphic (except the few points where GPP could not be predicted due to NA values) but when comparing it to the test dataset, it is all over the place, signifying an overfitting onto the training dataset and thus low generalization of the fit.
An interesting feature of the plots with higher values for k is the way the cloud of points becomes sort of “flat” at the top and bottom, meaning the fitted values don’t go above a certain threshold. This is due to how, when using 300 neighbors, the fit is basically unable to approximate outliers/more extreme values as more values being taken into account means the fitted values get closer to the average, and if pushed to the extreme using k = n (number of observations), the fitted values would just be a flat line at the level of the average observation.
The RMSE and R-Squared for testing and training data get increasingly closer to each other with higher values of k, and the values for testing data are pretty similar for all values of k between 5 and 300 that are examined here.
#plots
plots_draw(moreplots)
Figure 6: Temporal variation of measured GPP vs. knn model fit for both training and test sets for 8 different values on k (1, 2, 5, 8, 15, 50, 150 and 300)
#for reasons, the titles on these graphs simply didn't work.
The same things can be seen on plots for temporal variation of the GPP: the fitted values with k = 1 are exactly the same as for the measurements, whereas k = 300 gives a kind of wave that has close to the same extreme values for every summer and winter. These two things are clear signs of over- and underfitting.
The last part of the question is about the mean absolute error (MAE). Computing that value in itself is fairly simple thanks to thhe yardstick function “metrics”, but to apply it to different values of k we need to use a function (in order to make it more convenient) as well as a loop.
#function to determine MAE based on k
compute_MAE <- function(k){
knn_diff_k <- caret::train(
pp,
data = daily_fluxes_train |> drop_na(),
method = "knn",
trControl = caret::trainControl(method = "none"),
tuneGrid = data.frame(k = k),
metric = "RMSE")
daily_fluxes_test$fitted <- predict(knn_diff_k, newdata = daily_fluxes_test)
metrics_test <- daily_fluxes_test |>
yardstick::metrics(GPP_NT_VUT_REF, fitted)
MAE_knn_test <- metrics_test |>
filter(.metric == "mae") |>
pull(.estimate)
return(MAE_knn_test)
}
k_numbers <- (c("1", "2", "5", "8", "15", "50", "150", "300"))
MAE_table <- c(compute_MAE(different_ks[1]),
compute_MAE(different_ks[2]),
compute_MAE(different_ks[3]),
compute_MAE(different_ks[4]),
compute_MAE(different_ks[5]),
compute_MAE(different_ks[6]),
compute_MAE(different_ks[7]),
compute_MAE(different_ks[8]))
names(MAE_table) <- k_numbers
knitr::kable(MAE_table, caption = "*Table 1: Mean absolute error on the knn model for 8 different values of k*")
| x | |
|---|---|
| 1 | 1.495793 |
| 2 | 1.303756 |
| 5 | 1.175692 |
| 8 | 1.147647 |
| 15 | 1.123898 |
| 50 | 1.111188 |
| 150 | 1.128479 |
| 300 | 1.153778 |
The MAE values go down for a while with increasing k, then go back up after 50. Finding the optimal k for this model is the objective of the next exercise question.
Just like in the previous exercise, we use the mean absolute error (MAE) on the test dataset, determining it for every k between 0 and 100. This is done via looping and using the same function “compute_MAE” that was used for the last question. The MAEs are then plotted.
all_MAE <- c()
for (i in 1:100){
all_MAE[i] <- compute_MAE(i)
}
min(all_MAE)
## [1] 1.104253
which.min(all_MAE)
## [1] 32
ggplot(
data = data.frame(all_MAE), aes(x = 1:100, y = all_MAE)) +
geom_point() +
geom_point(aes(x = which.min(all_MAE),
y = min(all_MAE), color = 'red')) +
theme(legend.position = "none") +
labs(y = "Mean absolute error", x = "k")
Figure 6: Mean absolute error of fit compared to measurements on the test set for k between 1 and 100
Using the MAE as a metric and the test-datasets for various k, it seems as k = 32 (the red point) is the optimal value for highest generalisability, as it has the lowest Mean Absolute Error for the test data. However, it is hard to come to a simple conclusion using just this somewhat rigid dataset without any re-sampling or other metric.